Search CORE

11 research outputs found

GoFFish: A Sub-Graph Centric Framework for Large-Scale Graph Analytics

Author: Kumbhare Alok
Nagarkar Soonil
Prasanna Viktor
Raghavendra Cauligi
Ravi Santosh
Simmhan Yogesh
Wickramaarachchi Charith
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 22/11/2013
Field of study

Large scale graph processing is a major research area for Big Data exploration. Vertex centric programming models like Pregel are gaining traction due to their simple abstraction that allows for scalable execution on distributed systems naturally. However, there are limitations to this approach which cause vertex centric algorithms to under-perform due to poor compute to communication overhead ratio and slow convergence of iterative superstep. In this paper we introduce GoFFish a scalable sub-graph centric framework co-designed with a distributed persistent graph storage for large scale graph analytics on commodity clusters. We introduce a sub-graph centric programming abstraction that combines the scalability of a vertex centric approach with the flexibility of shared memory sub-graph computation. We map Connected Components, SSSP and PageRank algorithms to this model to illustrate its flexibility. Further, we empirically analyze GoFFish using several real world graphs and demonstrate its significant performance improvement, orders of magnitude in some cases, compared to Apache Giraph, the leading open source vertex centric implementation.Comment: Under review by a conference, 201

arXiv.org e-Print Archive

Crossref

PLAStiCC: Predictive Look-Ahead Scheduling for Continuous dataflows on Clouds

Author: Kumbhare Alok Gautam
Prasanna Viktor K
Simmhan Yogesh
Publication venue: IEEE
Publication date
Field of study

Scalable stream processing and continuous dataflow systems are gaining traction with the rise of big data due to the need for processing high velocity data in near real time. Unlike batch processing systems such as MapReduce and workflows, static scheduling strategies fall short for continuous dataflows due to the variations in the input data rates and the need for sustained throughput. The elastic resource provisioning of cloud infrastructure is valuable to meet the changing resource needs of such continuous applications. However, multi-tenant cloud resources introduce yet another dimension of performance variability that impacts the application's throughput. In this paper we propose PLAStiCC, an adaptive scheduling algorithm that balances resource cost and application throughput using a prediction-based lookahead approach. It not only addresses variations in the input data rates but also the underlying cloud infrastructure. In addition, we also propose several simpler static scheduling heuristics that operate in the absence of accurate performance prediction model. These static and adaptive heuristics are evaluated through extensive simulations using performance traces obtained from Amazon AWS IaaS public cloud. Our results show an improvement of up to 20% in the overall profit as compared to the reactive adaptation algorithm

Open Access Repository of IISc Research Publications

Fault-Tolerant and Elastic Streaming MapReduce with Decentralized Coordination

Author: Frincu Marc
Kumbhare Alok
Prasanna Viktor K
Simmhan Yogesh
Publication venue
Publication date
Field of study

The MapReduce programming model, due to its simplicity and scalability, has become an essential tool for processing large data volumes in distributed environments. Recent Stream Processing Systems (SPS) extend this model to provide low-latency analysis of high-velocity continuous data streams. However, integrating MapReduce with streaming poses challenges: first, the runtime variations in data characteristics such as data-rates and key-distribution cause resource overload, that in-turn leads to fluctuations in the Quality of the Service (QoS); and second, the stateful reducers, whose state depends on the complete tuple history, necessitates efficient fault-recovery mechanisms to maintain the desired QoS in the presence of resource failures. We propose an integrated streaming MapReduce architecture leveraging the concept of consistent hashing to support runtime elasticity along with locality-aware data and state replication to provide efficient load-balancing with low-overhead fault-tolerance and parallel fault-recovery from multiple simultaneous failures. Our evaluation on a private cloud shows up to 2.8x improvement in peak throughput compared to Apache Storm SPS, and a low recovery latency of 700 - 1500 ms from multiple failures

Crossref

Open Access Repository of IISc Research Publications

Distributed Programming over Time-series Graphs

Author: Choudhury Neel
Frincu Marc
Kumbhare Alok
Prasanna Viktor
Raghavendra Cauligi
Simmhan Yogesh
Wickramaarachchi Charith
Publication venue
Publication date
Field of study

Graphs are a key form of Big Data, and performing scalable analytics over them is invaluable to many domains. There is an emerging class of inter-connected data which accumulates or varies over time, and on which novel algorithms both over the network structure and across the time-variant attribute values is necessary. We formalize the notion of time-series graphs and propose a Temporally Iterative BSP programming abstraction to develop algorithms on such datasets using several design patterns. Our abstractions leverage a sub-graph centric programming model and extend it to the temporal dimension. We present three time-series graph algorithms based on these design patterns and abstractions, and analyze their performance using the GoFFish distributed platform on Amazon AWS Cloud. Our results demonstrate the efficacy of the abstractions to develop practical time-series graph algorithms, and scale them on commodity hardware

Crossref

Open Access Repository of IISc Research Publications